Skip to content

[quantization] Introduce a script for LLM evaluation#467

Merged
mhs4670go merged 1 commit intoSamsung:mainfrom
stamalakhov:lm_eval
Feb 9, 2026
Merged

[quantization] Introduce a script for LLM evaluation#467
mhs4670go merged 1 commit intoSamsung:mainfrom
stamalakhov:lm_eval

Conversation

@stamalakhov
Copy link
Contributor

@stamalakhov stamalakhov commented Feb 5, 2026

This PR introduces an option to run many LLM-related tasks using lm_eval package.

To use it please make sure that you have installed lm_eval

pip install lm-eval

The PR gives a chance to range quantization results not only using PPL degradation but also using accuracy on a list of benchmarks.

python tico/quantization/evaluation/script/llm_tasks_eval.py --model "HuggingFaceTB/SmolLM2-135M-Instruct"

Loading FP model …
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2376/2376 [00:27<00:00, 85.22it/s]
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9501/9501 [06:03<00:00, 26.17it/s]
results of HuggingFaceTB/SmolLM2-135M-Instruct evaluation:
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_easy|      1|none  |     0|acc     |↑  |0.5400|±  |0.0102|
|        |       |none  |     0|acc_norm|↑  |0.4882|±  |0.0103|

Draft: #436
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov requested review from a team and mhs4670go February 5, 2026 07:57
@stamalakhov stamalakhov self-assigned this Feb 5, 2026
@stamalakhov stamalakhov force-pushed the lm_eval branch 3 times, most recently from 35a94ad to 35b24da Compare February 5, 2026 08:51
) -> dict[str, Any]:
model_to_evaluate = HFLM(model, "causal", tokenizer=tokenizer)
tasks_list: list[str] = tasks.split(",")
return evaluator.simple_evaluate(model_to_evaluate, tasks=tasks_list)["results"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, there is a visualization API.

from lm_eval.utils import make_table

results = lm_eval.simple_evaluate(...)
print(make_table(results))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Thank you! I didn't know that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhs4670go I'll update the script.

mhs4670go
mhs4670go previously approved these changes Feb 5, 2026
Copy link
Contributor

@mhs4670go mhs4670go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

import argparse
from typing import Any

from lm_eval import evaluator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM,
But, could you provide your opinion, please, maybe lm_eval should be added to dependencies of the project

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would add a dependency on lm_eval. That is why this script is located at just script folder and has no tests.

When we need external dependencies like transformers or lm_eval, it would be good to have dedicated scripts or workflows in internal repo.

This PR introduces an option to run many LLM-related tasks using `lm_eval` package.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
Copy link
Contributor

@mhs4670go mhs4670go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhs4670go mhs4670go merged commit 3436e0f into Samsung:main Feb 9, 2026
7 checks passed
@stamalakhov stamalakhov deleted the lm_eval branch February 9, 2026 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants